Phi 4 Multimodal Instruct
MIT
Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.
Multimodal Fusion
Transformers Supports Multiple Languages